Deriving Effectiveness Measures for Data Quality Rules

نویسندگان

  • Lei Jiang
  • Alex Borgida
  • Daniele Barone
  • John Mylopoulos
چکیده

The poor quality of data constitutes a major concern worldwide, and an obstacle to data integration and analysis efforts. Detecting errors and inconsistencies using application specific data quality rules play an important role in data quality assessment. These rules have different efficacy and cost under different circumstances. In our previous work, we have proposed a quantitative framework for measuring and comparing data quality rules in terms of their effectiveness. Effectiveness formulas are built from variables that represent probabilistic assumptions about the occurrence of errors in data values, and our earlier work gave examples of how to derive these formulas in an ad-hoc fashion. This paper lays the foundations of a workbench-approach for systematically deriving effectiveness formulas. The approach involves several steps, including building Bayesian network graphs, adding (symbolic) probabilities to the nodes in the graph, and deriving effectiveness formulas. The graphs are built algorithmically, for a large and useful class of data quality rules. We present this approach and its implementation in Python, and report its evaluation results, which show that the resulting formulas give reasonable estimates of effectiveness scores under various scenarios.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

بررسی تاثیر کیفیت حکمرانی بر مهاجرت دانش‌ آموختگان

This article examines the impact of governance quality on the migration of highly educated persons. Using human capital and social transition theories of migration, two hypotheses are proposed. The cross-sectional data (for 1990 and 2000) have been used to test the hypotheses via the framework of a random utility model. Principal Component Analysis (PCA) is used to build a composite index of go...

متن کامل

Measuring the Effectiveness of Explicit and Implicit Instruction through Explicit and Implicit Measures

Many studies have examined the effect of different approaches to teaching grammar including explicit and implicit instruction. However, research in this area is limited in a number of respects. One such limitation pertains to the issue of construct validity of the measures, i.e. the knowledge developed through implicit instruction has been measured through instruments which favor th...

متن کامل

Association Rules Extraction using Multi-objective Feature of Genetic Algorithm

Association Rule Mining is one of the most well liked techniques of data mining strategies whose primary aim is to extract associations among sets of items or products in transactional databases. However, mining association rules typically ends up in a really large amount of found rules, leaving the database analyst with the task to go through all the association rules and find out the interest...

متن کامل

Explanations of Empirically Derived Reactive Plans

Given an adequate simulation model of the task environment and payoff function that measures the quality of partially successful plans, competition-based heuristics such as genetic algorithms can develop high performance reactive rules for interesting sequential decision tasks. We have previously described an implemented system, called SAMUEL, for learning reactive plans and have shown that the...

متن کامل

Explanations of Empirically Derived Reactive Plans

Given an adequate simulation model of the task environment and payoff function that measures the quality of partially successful plans, competition-based heuristics such as genetic algorithms can develop high performance reactive rules for interesting sequential decision tasks. We have previously described an implemented system, called SAMUEL, for learning reactive plans and have shown that the...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010